Search CORE

46 research outputs found

Generative and Discriminative Text Classification with Recurrent Neural Networks

Author: Blunsom Phil
Dyer Chris
Ling Wang
Yogatama Dani
Publication venue
Publication date: 01/01/2017
Field of study

We empirically characterize the performance of discriminative and generative LSTM models for text classification. We find that although RNN-based generative models are more powerful than their bag-of-words ancestors (e.g., they account for conditional dependencies across words in a document), they have higher asymptotic error rates than discriminatively trained RNN models. However we also find that generative models approach their asymptotic error rate more rapidly than their discriminative counterparts---the same pattern that Ng & Jordan (2001) proved holds for linear classification models that make more naive conditional independence assumptions. Building on this finding, we hypothesize that RNN-based generative classification models will be more robust to shifts in the data distribution. This hypothesis is confirmed in a series of experiments in zero-shot and continual learning settings that show that generative models substantially outperform discriminative models

arXiv.org e-Print Archive

Oxford University Research Archive

Learning Word Representations with Hierarchical Sparse Coding

Author: Dyer Chris
Faruqui Manaal
Smith Noah A.
Yogatama Dani
Publication venue
Publication date: 06/11/2014
Field of study

We propose a new method for learning word representations using hierarchical regularization in sparse coding inspired by the linguistic study of word meanings. We show an efficient learning algorithm based on stochastic proximal methods that is significantly faster than previous approaches, making it possible to perform hierarchical sparse coding on a corpus of billions of word tokens. Experiments on various benchmark tasks---word similarity ranking, analogies, sentence completion, and sentiment analysis---demonstrate that the method outperforms or is competitive with state-of-the-art methods. Our word representations are available at \url{http://www.ark.cs.cmu.edu/dyogatam/wordvecs/}

arXiv.org e-Print Archive

CiteSeerX

The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining

Author: Chiang Ting-Rui
Yogatama Dani
Publication venue
Publication date: 24/10/2023
Field of study

We analyze the masked language modeling pretraining objective function from the perspective of the distributional hypothesis. We investigate whether better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct analyses over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions.Comment: EMNLP 202

arXiv.org e-Print Archive

Understanding In-Context Learning with a Pelican Soup Framework

Author: Chiang Ting-Rui
Yogatama Dani
Publication venue
Publication date: 15/02/2024
Field of study

Many existing theoretical analyses of in-context learning for natural language processing are based on latent variable models that leaves gaps between theory and practice. We aim to close these gaps by proposing a theoretical framework, the Pelican Soup Framework. In this framework, we introduce (1) the notion of a common sense knowledge base, (2) a general formalism for natural language classification tasks, and the notion of (3) meaning association. Under this framework, we can establish a

\mathcal{O}(1/T)

loss bound for in-context learning, where

T

is the number of example-label pairs in the demonstration. Compared with previous works, our bound reflects the effect of the choice of verbalizers and the effect of instruction tuning. An additional notion of \textit{atom concepts} makes our framework possible to explain the generalization to tasks unseen in the language model training data. Finally, we propose a toy setup, Calcutec, and a digit addition task that mimics types of distribution shifts a model needs to overcome to perform in-context learning. We also experiment with GPT2-Large on real-world NLP tasks. Our empirical results demonstrate the efficacy of our framework to explain in-context learning

arXiv.org e-Print Archive